{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bf083fcd-a1a9-42c0-b02e-ea871965d690",
   "metadata": {},
   "source": [
    "# ColBERT v2, PLAID Classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "79bd0ae1-73c9-4be7-9f1b-18d6aa01af08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastrag.stores import PLAIDDocumentStore\n",
    "import fastrag, torch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee219c84-6fc8-48d9-81c0-d492fa3424f4",
   "metadata": {},
   "source": [
    "## Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "efff8cba-f066-4e8d-b91e-facc009541e6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Dec 05, 20:11:21] #> Loading collection...\n",
      "0M \n",
      "[Dec 05, 20:11:24] #> Loading codec...\n",
      "[Dec 05, 20:11:24] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "[Dec 05, 20:11:25] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "[Dec 05, 20:11:25] #> Loading IVF...\n",
      "[Dec 05, 20:11:25] #> Loading doclens...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 666.08it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Dec 05, 20:11:25] #> Loading codes and residuals...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15.11it/s]\n"
     ]
    }
   ],
   "source": [
    "store = PLAIDDocumentStore(index_path=\"/path/to/index\",\n",
    "                           checkpoint_path=\"Intel/ColBERT-NQ\",\n",
    "                           collection_path=\"/path/to/collection.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "18440c7a-8668-4897-907e-f7b06ae8a0cb",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Document: {'content': 'located close to the European Atlantic coast, in the southwest of France and in the north of the Aquitaine region. It is around southwest of Paris. The city is built on a bend of the river Garonne, and is divided into two parts: the right bank to the east and left bank in the west. Historically the left bank is more developed because when flowing outside the bend, the water makes a furrow of the required depth to allow the passing of merchant ships, which used to offload on this side of the river. But, today, the right bank is', 'content_type': 'text', 'score': 14.8046875, 'meta': {'title': 'Bordeaux'}, 'embedding': None, 'id': '48428'}>,\n",
       " <Document: {'content': 'which took place in Paris from April to October in 1925. This was officially sponsored by the French government, and covered a site in Paris of 55 acres, running from the Grand Palais on the right bank to Les Invalides on the left bank, and along the banks of the Seine. The Grand Palais, the largest hall in the city, was filled with exhibits of decorative arts from the participating countries. There were 15,000 exhibitors from twenty different countries, including England, Italy, Spain, Poland, Czechoslovakia, Belgium, Japan, and the new Soviet Union, though Germany was not invited because of tensions', 'content_type': 'text', 'score': 13.4609375, 'meta': {'title': 'Art Deco'}, 'embedding': None, 'id': '18726'}>,\n",
       " <Document: {'content': 'the golden age of Bordeaux. Many downtown buildings (about 5,000), including those on the quays, are from this period. Victor Hugo found the town so beautiful he once said: \"Take Versailles, add Antwerp, and you have Bordeaux\". Baron Haussmann, a long-time prefect of Bordeaux, used Bordeaux\\'s 18th-century large-scale rebuilding as a model when he was asked by Emperor Napoleon III to transform a then still quasi-medieval Paris into a \"modern\" capital that would make France proud. In 1814, towards the end of the Peninsula war, the Duke of Wellington sent William Beresford with two divisions, who seized Bordeaux without much', 'content_type': 'text', 'score': 13.3671875, 'meta': {'title': 'Bordeaux'}, 'embedding': None, 'id': '48425'}>]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store.query(\"Where is Paris?\",top_k = 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f535a951-8017-4223-9467-3d11326e799f",
   "metadata": {},
   "source": [
    "# Retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8e54ddaa-64af-4653-816c-ce002ae1dcba",
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastrag.retrievers.colbert import ColBERTRetriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ad8ce7f8-7923-4837-be19-bfa096f9ed2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = ColBERTRetriever(store)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7ffeafee-5883-47f9-87a7-6a5b80ef634c",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Document: {'content': \"intelligence. Machine learning, a fundamental concept of AI research since the field's inception, is the study of computer algorithms that improve automatically through experience. Unsupervised learning is the ability to find patterns in a stream of input, without requiring a human to label the inputs first. Supervised learning includes both classification and numerical regression, which requires a human to label the input data first. Classification is used to determine what category something belongs in, after seeing a number of examples of things from several categories. Regression is the attempt to produce a function that describes the relationship between inputs and\", 'content_type': 'text', 'score': 17.21875, 'meta': {'title': 'Artificial intelligence'}, 'embedding': None, 'id': '11791'}>,\n",
       " <Document: {'content': 'as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and association rule mining, among others. To indicate some of them, self-organizing maps and \"k\"-means are examples of cluster algorithms; neural networks implementation and support vector machines models are examples of common machine learning algorithms. Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientist is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results. On the other hand, the advent of modern computer technology and relatively cheap computing resources have', 'content_type': 'text', 'score': 17.21875, 'meta': {'title': 'Biostatistics'}, 'embedding': None, 'id': '42375'}>,\n",
       " <Document: {'content': 'learning approaches. The decision tree is perhaps the most widely used machine learning algorithm. Other widely used classifiers are the neural network, k-nearest neighbor algorithm, kernel methods such as the support vector machine (SVM), Gaussian mixture model, and the extremely popular naive Bayes classifier. Classifier performance depends greatly on the characteristics of the data to be classified, such as the dataset size, distribution of samples across classes, the dimensionality, and the level of noise. Model-based classifiers perform well if the assumed model is an extremely good fit for the actual data. Otherwise, if no matching model is available, and if', 'content_type': 'text', 'score': 16.796875, 'meta': {'title': 'Artificial intelligence'}, 'embedding': None, 'id': '11826'}>]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.retrieve(\"What is Machine Learning?\", 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ca82086-4731-4ff1-8933-15e41ca862a4",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe75ef99-5a13-4aae-9c4e-1f7c0b222bcd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "fc0ea21a-cf73-462c-934d-a489b49f97d1",
   "metadata": {},
   "source": [
    "# Haystack Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "c47e6573-bc98-48b1-a06e-3d679ba44492",
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "7b768689-10fd-4556-959c-4ab0893769cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "p = Pipeline()\n",
    "p.add_node(component=retriever, name=\"Retriever\", inputs=[\"Query\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b891e9d9-ab9c-4e84-a94a-d55c79926d75",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Document: {'content': 'his Prague stay, he wrote 11 scientific works, five of them on radiation mathematics and on the quantum theory of solids. In July 1912, he returned to his alma mater in Zürich. From 1912 until 1914, he was professor of theoretical physics at the ETH Zurich, where he taught analytical mechanics and thermodynamics. He also studied continuum mechanics, the molecular theory of heat, and the problem of gravitation, on which he worked with mathematician and friend Marcel Grossmann. On 3 July 1913, he was voted for membership in the Prussian Academy of Sciences in Berlin. Max Planck and Walther Nernst', 'content_type': 'text', 'score': 20.828125, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2081'}>,\n",
       " <Document: {'content': 'technology\". Much of his work at the patent office related to questions about transmission of electric signals and electrical–mechanical synchronization of time, two technical problems that show up conspicuously in the thought experiments that eventually led Einstein to his radical conclusions about the nature of light and the fundamental connection between space and time. With a few friends he had met in Bern, Einstein started a small discussion group in 1902, self-mockingly named \"The Olympia Academy\", which met regularly to discuss science and philosophy. Their readings included the works of Henri Poincaré, Ernst Mach, and David Hume, which influenced his', 'content_type': 'text', 'score': 20.671875, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2078'}>,\n",
       " <Document: {'content': 'level years ahead of his peers. The twelve year old Einstein taught himself algebra and Euclidean geometry over a single summer. Einstein also independently discovered his own original proof of the Pythagorean theorem at age 12. A family tutor Max Talmud says that after he had given the 12 year old Einstein a geometry textbook, after a short time \"[Einstein] had worked through the whole book. He thereupon devoted himself to higher mathematics... Soon the flight of his mathematical genius was so high I could not follow.\" His passion for geometry and algebra led the twelve year old to become', 'content_type': 'text', 'score': 20.578125, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2069'}>]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = p.run(query=\"What did Einstein work on?\")\n",
    "res[\"documents\"][:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e0d9dd3-ba39-4b2e-b0b6-d3080c2d8979",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "027efff8-f345-4ec9-87fa-e4c048f83cf6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "be756852-8a40-482c-b7e5-ca16b3d95c3f",
   "metadata": {},
   "source": [
    "## Pipeline from YAML\n",
    "```yaml\n",
    "version: 1.12.2\n",
    "\n",
    "components:\n",
    "- name: Store\n",
    "  params:\n",
    "    checkpoint_path: Intel/ColBERT-NQ\n",
    "    collection_path: /path/to/collection.tsv\n",
    "    index_path: /path/to/index\n",
    "  type: PLAIDDocumentStore\n",
    "- name: Retriever\n",
    "  params:\n",
    "    document_store: Store\n",
    "    top_k: 10\n",
    "    use_gpu: false\n",
    "  type: ColBERTRetriever\n",
    "\n",
    "pipelines:\n",
    "- name: my_pipeline\n",
    "  nodes:\n",
    "  - inputs:\n",
    "    - Query\n",
    "    name: Retriever\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "60d9e76a-d806-4976-8f6a-7723889ca701",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Dec 05, 20:12:49] #> Loading collection...\n",
      "0M \n",
      "[Dec 05, 20:12:51] #> Loading codec...\n",
      "[Dec 05, 20:12:51] #> Loading IVF...\n",
      "[Dec 05, 20:12:51] #> Loading doclens...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1044.27it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Dec 05, 20:12:51] #> Loading codes and residuals...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22.42it/s]\n"
     ]
    }
   ],
   "source": [
    "pipeline = Pipeline.load_from_yaml(\"plaid-colbert-pipeline.yaml\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "009d8264-b493-411b-bf95-11bf6e4221ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Document: {'content': 'his Prague stay, he wrote 11 scientific works, five of them on radiation mathematics and on the quantum theory of solids. In July 1912, he returned to his alma mater in Zürich. From 1912 until 1914, he was professor of theoretical physics at the ETH Zurich, where he taught analytical mechanics and thermodynamics. He also studied continuum mechanics, the molecular theory of heat, and the problem of gravitation, on which he worked with mathematician and friend Marcel Grossmann. On 3 July 1913, he was voted for membership in the Prussian Academy of Sciences in Berlin. Max Planck and Walther Nernst', 'content_type': 'text', 'score': 20.828125, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2081'}>,\n",
       " <Document: {'content': 'technology\". Much of his work at the patent office related to questions about transmission of electric signals and electrical–mechanical synchronization of time, two technical problems that show up conspicuously in the thought experiments that eventually led Einstein to his radical conclusions about the nature of light and the fundamental connection between space and time. With a few friends he had met in Bern, Einstein started a small discussion group in 1902, self-mockingly named \"The Olympia Academy\", which met regularly to discuss science and philosophy. Their readings included the works of Henri Poincaré, Ernst Mach, and David Hume, which influenced his', 'content_type': 'text', 'score': 20.671875, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2078'}>]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.run(query=\"What did Einstein work on?\")[\"documents\"][:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8be3ae2d-b325-42b6-8fad-a2aba86984da",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "628a47d6-3556-4aee-b8ca-9cce921191e9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70752dcb-d7e2-4066-9d44-0f870dc9f437",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "df5ae5c4-c275-4125-9bf0-06b13959786e",
   "metadata": {},
   "source": [
    "------"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a43a69a-5f08-469b-aa24-f556df696de0",
   "metadata": {},
   "source": [
    "# Index Creation\n",
    "\n",
    "1. Install the package and go to `libs/colbert`. \n",
    "2. Install it using Anaconda/Miniconda, either GPU or CPU using the provided yaml files. \n",
    "3. Make sure you have RTX GPUs or better (RTX 3090, V100, etc.)\n",
    "4. Use our own `scripts/indexing/create_plaid.py` script (fastRAG). \n",
    "\n",
    "Example:\n",
    "```sh\n",
    "python scripts/indexing/create_plaid.py            \\\n",
    "         --checkpoint /path/to/checkpoint          \\\n",
    "         --collection /path/to/collection.tsv      \\\n",
    "         --index-save-path /path/to/index          \\\n",
    "         --gpu 2 --ranks 2\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ff7b9e2-1a22-4cfd-8198-1e219f18e088",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d676bf16-31e3-4e6b-8019-9f6c5cc685db",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5698e1b-3c39-4ead-8cc3-ecf2464e5778",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "01262dede09baa68616418263efd26d33bafc508f82c218954990f624836b45d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
